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Using RNA sequencing of triple-negative breast cancer (TNBC), non-TBNC and HER2 -positive breast cancer 
sub-types, here we report novel expressed variants, allelic prevalence and abundance, and coexpression with 
other variation, and splicing signatures. To reveal the most prevalent variant alleles, we overlaid our findings 
with cancer- and population-based datasets and validated a subset of novel variants of cancer-related genes: 
ESRP2, GBP1, TPP1, MAD2L1BP, GLUD2 and SLC30A8. As a proof-of-principle, we demonstrated that a 
rare substitution in the splicing coordinator ESRP2 (R353Q) impairs its ability to bind to its substrate FGFR2 
pre-mRNA. In addition, we describe novel SNPs and INDELs in cancer relevant genes with no prior reported 
association of point mutations with cancer, such as MTAP and MAGED1. For the first time, this study 
illustrates the power of RNA-sequencing in revealing the variation landscape of breast transcriptome and 
exemplifies analytical strategies to search regulatory interactions among cancer relevant molecules. 

Breast cancer is the third most frequent cancer in the world as it affects approximately one in ten women in 
the western world 1 . The initial knowledge that connected breast cancer to genetic susceptibility originated 
from the clinical observations that highlighted the clustering of breast cancer cases in families 2,3 . 
Approximately 5-10% of breast cancers are believed to result from the inheritance of rare genetic components 
that confer significantly elevated risk 4,5 . For example, mutations in the tumor suppressor genes BRCA1 and 
BRCA2 account for approximately 16% of the familial breast cancer 6-8 . The vast majority of breast cancer cases, 
however, are derived from a complex interaction between multiple environmental, lifestyle and genetic factors 
with relatively weak individual risk contribution 9,10 . 

While the effects of many environmental and lifestyle factors, such as diet, reproductive behavior and radiation 
are well appreciated, the knowledge on genetically contributing patterns is limited. Association studies have 
identified ATM, BRIP1, CASP8, CDH1, CHEK2, PALB2, PTEN, STK11, and TP53 as breast cancer susceptibility 
genes. Such mutations collectively account for 2.3% of familial risk of breast cancer, and together with BRCA1, 
BRCA2 and others have been implicated in high risk screening strategies 5,8,1120 . Nonetheless, significant propor- 
tion of the familial and non-familial breast cancer susceptibility remains unknown, suggesting plethora of genetic 
elements that need to be understood. 

Transcriptome sequencing comprises a unique interplay between individual genetic background, reflected in 
the variation content, and the epigenetic and environmental regulation affecting gene expression levels and splice 
patterns. Recent transcriptome sequencing efforts have highlighted important somatic events in metastatic triple 
negative breast cancer (TNBC) and described important for the clinical outcome genotype-phenotype correla- 
tions 21 . Further, transcriptome sequencing data have been successfully explored to reveal disrupted pathways in 
TNBC through genome-wide loss of heterozygosity and mono-allelic expression estimation 22 . As a result of these 
and other studies, the feasibility of transcriptome sequencing to uncover molecular mechanisms of breast cancer 
drivers is increasingly appreciated 23 . 
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Here we used whole transcriptome RNA-sequencing to reveal the 
variation signatures of 17 breast cancer patient tissues, and compared 
with human normal breast organoids (referred from here on as nor- 
mal breast tissue, NBT). The 17 samples include six TNBC, lacking 
expression of therapeutically significant components - estrogen 
receptor (ER), progesterone-receptor (PR) and the Human 
Epidermal Growth Factor Receptor 2 (HER2); six Non-TNBC (ER, 
PR and HER2-positive); and five HER2-positive samples (ER and PR 
negative). Compared to the extensively performed searches for so- 
matic breast cancer mutations, our RNA-sequencing based approach 
detects SNPs that are expressed at the mRNA level, and allows 
estimation of their allelic expression at nucleotide resolution. A set 
of novel variants were validated through Sanger sequencing. As a 
proof-of-principle, we have explored the effect of a rare SNP- 
p.R353Q - in the epithelial splicing regulatory protein ESRP2, on 
the binding and splicing of its target pre-mRNA. Our study reports 
a set of novel mutations in essential regulatory molecules in breast 
cancer and discusses their allelic preferential expression and poten- 
tial involvement in breast cancer. 

Results 

Analytical strategies and overall variation landscape. We set out to 

define the transcribed variation profile of TNBC, Non-TNBC and 
HER2-positive breast cancer samples. To achieve this, we applied 
mRNA sequencing on 17 breast cancer samples from unrelated 
individuals as well as on three NBT samples on the Illumina HiSeq 
2000 platform. The raw reads were aligned against Ensembl 
GRCh37.62 B (hgl9) using TopHat 24 , and the variants were called 
using Samtools 25 . Prior to filtering, a total of 1,876,617 SNPs, 331,197 
of which were novel, were called across all 17 breast cancer samples, 
and between 30,294 and 258,465 SNPs (average 1 10,389) were called 
in each individual sample (Supplementary Table 1). The overview of 
the workflow and the filtering strategy is presented in Figure 1. The 
SNP calls were separated into two groups - either reported in the 
databases (between 22,914 and 218,411 per sample, average 91,201), 
or novel. The previously reported SNPs, due to validation by at-least 
one independent group, were analyzed further without filtering. To 
increase the confidence in the calls of novel variants, we initially 
analyzed the SNP calling reads of 1,000 SNPs through Integrative 
Genomics Viewer (IGV) files, and 96 of the calls were tested by 
Sanger sequencing. Based on the findings of this pilot validation 
test, we set up filtering criteria retaining minimum false-positive 
and false-negative calls as follows: those supported by at least of 
three bidirectional reads with unique start position, minimum 
phred quality value of 20, mapping quality value (MQV) > 20, 
and presence in 3 or less different samples. To ensure that we were 
not missing any novel high prevalence SNPs among our samples, all 
the positions at which a novel SNP was called in 4 or more samples 
were visually examined through IGV before to be assigned as false 
positives - no novel SNPs called in 4 or more samples were identified. 
This filtering left us with between 60 and 1143 novel variants per 
sample (average 285). The transition to transvertion (Ts/Tv) ratio 
among the novel coding SNPs was 2.8 and aligns with previously 
reported values for human exome, thus increasing the confidence of 
our filtering algorithm 26,27 . 

Prior to filtering, between 1,574 and 11,669 previously reported 
INDELs were called in each of the studied breast cancer samples and 
subjected to further analysis (See Supplementary Table 1). The novel 
INDELs were quality filtered to remove calls with MQV less than 20, 
phred quality value below 20 and presence in three or more different 
samples. This left between 18 and 142 novel INDELs (average 59) per 
sample, which were retained in our further analysis. 

Expressed SNP density. To assess the overall expressed variation 
landscape of the breast cancer samples, we estimated the SNP 
density by counting the number of SNPs per megabase (MB) 



genome intervals. The SNP density was calculated individually for 
each sample and compiled per group (TNBC, non-TNBC and HER- 
2), and as a whole for the 17 samples (Figure 2). Overall, the SNP 
density distribution across the three groups was very similar, with a 
few regions showing group-specific high-density loci. All the TNBC 
samples presented with high number of SNPs in the region of 
chrl4:10500000-106000000, which was mainly contributed by 
increased overall SNP number in the large gene encoding 
nucleoprotein AHNAK2. Specifically enriched in all non-TNBC 
samples was the region on chrl9:53000000, mainly due to high 
number of SNPs in the zinc finger protein (ZNF) encoding genes. 

We also overlapped the expressed overall SNP density in our 
samples with somatic genome SNP density calculated from the 
COSMIC dataset. There was a significant overlap in the overall 
SNP distribution. However, the regions with highest SNP frequency 
differed: while in the COSMIC dataset they were chr2:48000000, 
chrl7:20000000 and chr5:72000000, the three top SNP-enriched 
regions when all of our samples were analyzed together were 
chr6:31000000-32000000, chr8:144000000 and chrl9:53000000. 
While the high density observed on chromosome 6 was due to a 
the well-known variability in the histone cluster (HIST1H1A) and 
major histocompatibility complex (HLA), the chromosome 8 region 
was enriched by variants in epidermal antigen Epiplakinl (EPPK1) 
and lymphocyte antigen (LY6E) (see Figure 2). 

Comparative analysis with cancer genome variations (COSMIC). 

We compared the SNPs identified in our samples with the COSMIC 
cancer genome somatic mutation database (http://www.sanger. 
ac.uk/cosmic) 2S . A total of 2,169 SNPs from the COSMIC database 
were found among our samples, 129 of which were present in more 
than 10 of the 17 samples, and 6 were called in all breast cancer 
samples. Only one SNP - the relatively common variant R1322X 
in the ABC transporter gene ABCA10 was nonsense, 515 were 
missense and 20 were located within a splice site. Of note, only 
two of the SNPs in our dataset, both UTR located, overlapped with 
COSMIC variants found in breast cancer: 1) the promoter T > C 
substitution in the proto-oncogene binding Yes-associated protein 
(YAP1) was seen in 8 of our 17 patients, and 2) the 3'-UTR C> T 
substitution in peptidylprolyl isomerase F (PPIP) was found in 5 of 
our samples. Among these comparisons, highly represented in our 
datasets were COSMIC missense variants in the DNA-repair 
encoding probable helicase senataxin SETX and Ewing's tumor- 
associated antigen ETAA1. 

GWAS associated SNPs in the breast cancer transcriptome. To 

outline SNPs that have been previously associated with breast 
cancer phenotypes, we overlaid our datasets with the publically 
available genome wide association studies (GWAS); the results are 
summarized in Supplementary Table 2. The pre-B-cell leukemia 
homeobox 1 (PBX1) intronic SNP rsl387389 that has been 
reported to strongly associate with early onset breast cancer 29 was 
present in 4/17 samples, two of which were homozygous. Similarly, 
two breast cancer associated SNPs in the fibroblast growth factor 
receptor (FGFR2) 30 , rs2420946 and rs2981582, were present each 
in two of our samples, (one patient was a carrier of both), again, in 
a homozygous state. Of note, the mitogen-activated protein kinase 
kinase kinase (MAP3K1) SNP reported by the same study was not 
present among our samples, however, we found a higher prevalence 
of the closely positioned D860N and V906I missense substitutions in 
MAP3K1; they were called in 13 (9 homozygous) and 16 (11 
homozygous) of our samples, respectively. Similar high homozy- 
gocity prevalence was seen for the rs704010 rs8170, rs2180341, 
rsl3281615, rs3817198 and rs4973768, but was not observed for 
the intergenic rs44 15084. Other SNPs reported to be in strong 
association with breast cancer from recent meta-analyses 29 33 were 
not seen in our samples. 
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Figure 1 | Workflow and filtering overview. Different filtering strategies were applied to the novel and the previously reported variants. The previously 
reported variants, due to validation by at-least one independent group, were analyzed further without filtering. The filtering criteria for the novel variants 
were set as follows: those supported by a minimum of three bidirectional reads with unique start position, minimum phred quality value of 20, 
mapping quality value (MQV) > 20, and presence in 3 or less different samples. All the positions at which a novel SNP was called in 4 or more samples 
were visually examined through IGV before assignment as false positives. 



Variations in genes previously implicated in hereditary breast 
cancer. To search for known predisposing breast cancer variants 
among our samples, we extracted SNPs and INDELS in genes that 
have been previously associated with hereditary breast cancer. 
Among all 17 samples, 80 SNP calls (38 unique SNPs) and 66 
INDEL calls (38 unique INDELs) mapped within ATM, BRCA1, 
BRCA2, BRIP1, CASP8, CDH1, CHEK1, PTEN, STK11 or TP53. 
While the majority of the SNPs called in those genes variants were 



common or have no known effect on the protein, several variants 
have been previously linked to breast cancer predisposition (Table 1). 
In BRCA1 and BRCA2 collectively, twelve different missense 
substitutions were identified in a total of nine patients. Five of the 
missense substitutions (p.Q356R, p.R496H, and p.D693N in BRCA1, 
and p.N289H and p.D1420Y in BRCA2) have been previously 
associated with breast cancer through either family or case-control 
studies 34,35 . Three patients from the non-TNBC group were carriers 
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Figure 2 | Expressed SNP density expressed as number of SNPs per megabase (MB) genome intervals. The SNP density was calculated individually for 
each sample and compiled per group (TNBC, non-TNBC and HER-2), and as a whole for the 17 samples. Overall, similar SNP density distribution is 
observed across the three groups. (A) Circos plot representing the high density expressed SNPs in TNBC, Non-TNBC and HER2 positive Breast Cancer 
Samples compared to cancer genome SNP data from COSMIC. (B) The highest SNP density loci for TNBC, Non-TNBC, HER2 and COSMIC, 
compared to the SNP density for the same locus for the other three groups. The highest SNP density for the COSMIC was observed in the interval 
chr2:48000000-48999999, containing the genes MSH6, FBX011, FOXN2, PPP1R21, STON1, GTF2A1L and LHCGR, while very low expressed SNP density 
for this region was measured in all three breast cancer subtypes. 

of the missense variant p.Q356R, and the other cancer- associated patient carried a small BRCA1 deletion (chrl7:41246251delC, 
variants were present in one patient each - from either non-TNBC rs80357794) that leads to a frame-shift and premature stop codon 
or HER2 positive groups. The two BRCA2 missense substitutions expected to completely abolish protein function. In the ATM gene, 
were seen in HER2-positive patients. In addition, one TNBC we identified the non-synonymous substitutions p.F858L and 
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Table 1 Variants in ATM, BRCA1, BRCA2 and STK1 1 identified in the 1 7 breast cancer samples (from HGMD, http://www.hgmd.cf. 
ac.uk/ac/index.php. The homo- or heterozygosity and the number of the unique variant and reference reads are also shown 

Chromosomal Cancer Cancer Var/ref 
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p.L1420F, which have been previously associated with increased risk 
for breast cancer 36 . The ATM missense p.F858L is known to impact 
the interaction of ATM with beta-adaptin, which is necessary for 
clatherin mediated receptor endocytosis and is proposed to 
contribute to the hereditary radio sensitivity and breast cancer 37,38 . 
In addition to breast cancer-associated variants, a missense 
substitution in STK11, p.H175Y, previously reported in a lung 
carcinoma 39 was been found in one patient. One non-TNBC 
patient carried simultaneously pathogenic variants in BRCA1 and 
ATM. Of note, overall a higher number of variant versus reference 
reads was assessed across BRCA1, BRCA2 and ATM variations. 

Prevalence of rare variants. To reveal variants that might be overre- 
presented in our samples compared to the general population, we 
compared the allele frequency of coding SNPs called in our samples 
against 11,666 alleles from the Exome Sequencing Project dataset 
(http://evs.gs.washington.edu/EVS/). To minimize error due to 
different variant calling platforms and to increase the statistical 
significance of the findings, we excluded from this analysis SNPs 
called in less than 10 alleles from the ESP dataset and in less than 3 
individuals among our 17 samples. SNPs called in all 17 of our 
samples were also excluded. For the purposes of allele frequencies 
comparison, we assigned two alleles for every homozygote call in our 
dataset; and Yates corrected chi-square was calculated for each 
distribution. The top 50 most prevalent missense SNPs among the 
17 samples are presented in Table 2. The highest difference in the 
allele distribution between our dataset and ESP was estimated for 



rs2305376 in the gene HOOK2, encoding a component of the FTS/ 
Hook/FHIP (FHF) complex that has a role in vesicle trafficking and 
maintenance of centrosome function and is known to interact with 
the /LWproto-oncogene 40,41 . Interestingly, the variant is predicted to 
be damaging change due to glycine to arginine substitution 
(p.GlOR), which in addition to its low prevalence in the ESP 
datasets, is rare to absent in the European population datasets (see 
Table 2). This variant was called in 3 of our 17 samples, and all of 
them were called homozygous due to the high abundance of variant 
over wild type reads. Another overrepresented SNP in our dataset 
was the missense substitution p.T573A in the protein tyrosine 
phosphatase PTPN12, whose activity is lost in a large proportion of 
breast cancer cases 42 . Of note, while PTPN12 loss is most strongly 
associated with the TNBC phenotype 43 , we found this variant equally 
prevalent in all three breast cancer subtypes; one TNBC and one 
HER2 -positive samples carried it in homozygote state. Other 
breast cancer implicated genes with prevalent variants amongst 
our samples were PLEC, PRCP, DSG2 and ERBB2IP, all harboring 
predicted to be damaging aminoacid changes 44 47 . Potential 
contribution of such variants to the phenotype in these patients is 
worth investigation. 

Deleterious protein mutations. A selected subgroup of potentially 
deleterious SNPs consisted of mutations predicted to generate 
premature stop (PMS) codons through either nonsense substitution 
or a splice-site aberration leading to a frame-shift due to out-of- frame 
exon skip or intron retention. In this group we also retained SNPs 
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removing a stop codon, because of their known severe biological 
consequences. A total of 1,593 variants of this type were called 
across our 17 samples, from which 77 were different nonsense 
SNPs and 16 were INDELs leading to a stop codon generation 
(Table 3). Among the stop-codon mutations, p.R93X in Steroid 
Receptor activator SRA1 was present in 10 of the 17 samples, and 
in 8 of them it was called homozygous. Given that SRA1 is involved 
in regulating the activity of steroid receptors and is deregulated in 
breast cancer 48 , the high expression rate of a nonsense variant might 
indicate functional implications in our breast cancer samples. In 
addition, two HER2 and two TNBC samples were positive for the 
p.Q281X in the zinc-activated ligand-gated ion channel (ZACN). 
Interestingly, the nucleotide change causing this mutation - chrl7: 
74077797 C > T - also resides in the 5'UTR of the gene encoding 
exocyst complex component (EXOC7), whose deregulated expression 
was reported to be a strong predictor for metastatic outcome in early 
stage TNBC 49 . The two TNBC samples positive for the variant did not 
show presence of reference reads in this position. When all the genes 
affected by a deleterious mutation were analyzed through Ingenuity 
Pathway Analysis (IP A), the top affected molecular networks were 
cell death and survival, cellular development, and cellular growth and 
proliferation, and the top affected canonical pathway was estrogen 
receptor signaling (Supplementary Figure 1). 

Novel expressed variations in breast cancer, and allele specific 
expression. The statistics on the filtered novel SNPs and INDELs 
are summarized in Supplementary Table 3; a complete list of the 
novel exonic annotated variations is available upon request. As 
expected, majority novel variants mapped within gene regions 
(70% of the SNPs and 66% INDELs). Filtering out of the intronic 
calls significantly reduced both SNP and INDEL numbers to between 
43 and 186 SNPs per sample (average 76) and between none and 17 
INDELs (average 8). Overall, 8% of the novel intergenic SNPs and 4% 
of the novel INDELs mapped within exons. Across the 17 samples, 
the total number of genes with coding and regulatory sequences 
affected by at least one novel SNP was 2103, and the genes with at 
least one novel INDEL were 566. A selected set of exonic variants 
were confirmed by Sanger sequencing (Figure 3). 

From the novel exonic SNPs, 285 unique SNPs were predicted to 
alter the protein sequence. Based on position and function, three of 
these SNPs were annotated to generate a novel stop codon, 1 14 were 
located within 2 bp of a splice junction, and 174 were missense, from 
which 70 were predicted to significantly affect the protein function. 
Six novel SNPs had dual annotation: missense substitutions located 
at a splice site. A total of 121 novel coding SNPs affected highly 
conservative nucleotide positions. Three novel coding SNPs - one 
missense and two synonymous substitutions - were called in two 
samples, and one - a stop codon in the solute carrier SLC30A8 
(p.Q28X) - in three different breast cancer samples (Table 4). 
Interestingly, p.Q28X affects only one (NM_1 7385 1.2) of the five 
protein coding SLC30A8 splice isoforms; this isoform possesses an 
alternative 5 ' -end and is present in all three samples from our set that 
expressed this isoform. The stop codon is located early in the protein 
chain and likely leads to complete abolishment of the protein 
expression. Since this SLC30A8 isoform was not expressed in the 
remaining 14 breast cancer samples, this early stop codon may indi- 
cate regulatory mechanism preventing the expression of this particu- 
lar isoform in breast tissue. 

To assess potential allele preferential expression, we analyzed the 
ratio of reference and variant reads at all coding positions for the 
novel SNPs called by 6 or more reads harboring the variant nucleo- 
tide (Figure 4). Fifty seven of these novel SNPs were called by variant 
reads only (i.e. no reference call was present at the corresponding 
position), and additional 53 showed higher than 5-fold number of 
variant calling reads over the wild type (Supplementary Table 4). 
Among the most preferentially expressed novel SNPs were missense 
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Table 3 | Deleterious mutations identified among the 1 7 Breast Cancer Samples 



Gene 


Sample ID 


# Samples 


Function 




Location 


REF allele 


Varallel 


Change 


Zygosity 


ABCA10 


56 


1 


stop-gained 


chrl 7:67149973 


G 


A 


p.R1322X 


homozygote 


C 1 7orf77 


IP2-76 


2 


stop-gained 


chrl 7:72588806 


C 


A 


p.C207X 


homozygote 


C 1 7orf77 


IP2-71 


2 


stop-gained 


chrl 7:72588806 


C 


A 


p.C207X 


homozygote 


C5orf20 


56 




stop-gained 


chr5 


134782450 


T 


A 


p.Rl 1 7X 


homozygote 


CARM1 


IP2-50 


! 


stop-gained 


chrl 9:1 1019790 


C 


A 


p.Y155X 


heterozygote 


CCDC25 


IP2-90 


1 


stop-gained 


chr8 


27610077 


C 


A 


p.E66X 


homozygote 


DCAF1 1 


IP2-50 


1 


stop-gained 


chrl 4:245924 13 


c 


A 


p.Y486X 


heterozygote 


EFHA1 


IP2-69 


1 


stop-gained 


chrl 3:22077082 


T 


A 


p.K306X 


heterozygote 


EGFL6 


1 71 


1 


stop-gained 


chrX: 13624542 


c 


T 


p.Rl 89X 


homozygote 


ERV3-1 


IP2-50 


1 


stop-gained 


chr7:64452738 


G 


A 


p.R223X 


homozygote 


EXOC7,ZACN 


56 


4 


stop-gained 


chrl 7:74077797 


C 


T 


p.Q281X 


heterozygote 


EXOC7,ZACN 


IP2-76 


4 


stop-gained 


chrl 7:74077797 


C 


T 


p.Q281X 


homozygote 


EXOC7,ZACN 


IP2-78 


4 


stop-gained 


chrl 7:74077797 


C 


T 


p.Q281X 


homozygote 


EXOC7,ZACN 


171 


4 


stop-gained 


chrl 7:74077797 


C 


T 


p.Q281X 


heterozygote 


FCGR2A 


IP2-65 


1 


stop-gained 


chrl 


161476204 


C 


T 


p.Q62X 


homozygote 


GAB4 


IP2-90 


1 


stop-gained 


chr22: 17469049 


c 


A 


p.G163X 


homozygote 


GET4 


IP2-83 


1 


stop-gained 


chr7 


93 1 966 


c 


A 


p.Y219X 


homozygote 


HNRNPR 


IP2-76 


1 


stop-gained 


chrl 


23637469 


G 


T 


p.Y460X 


heterozygote 


IL17RB 


IP2-83 


1 


stop-gained 


chr3 


53899276 


c 


T 


p.Q484X 


heterozygote 


LAIR2 


IP2-76 


1 


stop-gained 


chrl 9:5501 9261 


C 


T 


p.R76X 


heterozygote 


LOC 1 009964 


26 


1 


stop-gained 


chr6:57398270 


c 


T 


p.Q325X 


heterozygote 


MAD2L1 BP 


IP2-49 


1 


stop-gained 


chr6:43608124 


c 


T 


p.R227X 


heterozygote 


MADD 


IP2-83 


1 


stop-gained 


chrl 1:47306630 


c 


T 


p.R766X 


homozygote 


MAGEB16 


IP2-66 


1 


stop-gained 


chrX:35821 127 


c 


T 


p.R272X 


homozygote 


METAP1 


IP2-50 


1 


stop-gained 


chr4:99982427 


c 


T 


p.R374X 


heterozygote 


MTA2 


IP2-83 


1 


stop-gained 


chrl 


1:62364262 


G 


T 


p.Y243X 


homozygote 


NHLRC2 


IP2-76 


1 


stop-gained 


chrl0:l 15618327 


C 


A 


p.Y73X 


heterozygote 


PDE4DIP 


171 




stop-gained 


chrl 


144915561 


G 


A 


p.R622X 


heterozygote 


PDE4DIP 


IP2-65 


1 


stop-gained 


chrl 


145075683 


C 


T 


p.W60X 


homozygote 


PDE4DIP 


IP2-49 




stop-gained 


chrl 


144916676 


C 


T 


p.W560X 


heterozygote 


PDE4DIP 


26 




stop-gained 


chrl 


144916676 


C 


T 


p.W560X 


heterozygote 


PELI3 


IP2-50 


1 


stop-gained 


chrl 


1:66235714 


G 


T 


p.E39X 


homozygote 


PKD1 L2 


IP2-65 


1 


stop-gained 


chrl6:81242198 


G 


A 


p.Q220X 


homozygote 


PRB4 


IP2-49 




stop-gained 


chrl 2:1 1461802 


G 


A 


p.R39X 


homozygote 


PRM3 


IP2-76 




stop-gained 


chrl 6:1 1367143 


G 


A 


p.R104X 


homozygote 


RHBDD3 


IP2-49 


1 


stop-gained 


chr22:29656431 


C 


T 


p.W289X 


homozygote 


SKIV2L 


IP2-42 


1 


stop-gained 


chr6:3 1936654 


C 


T 


p.R1063X 


homozygote 


SYNE2 


26 




stop-gained 


chrl 4:64560092 


G 


A 


p.W4001X 


homozygote 


TMEM134 


IP2-42 




stop-gained 


chrl 


1:67235051 


G 


A 


p.R84X 


heterozygote 


VPS13B 


171 




stop-gained 


chr8 


100133706 


T 


G 


p.Y413X 


homozygote 


ZSWIM3 


171 


1 


stop-gained 


chr20:4450541 1 


G 


T 


p.E72X 


homozygote 


ANKS1A 


IP2-69 




INDEL 


chr6:34738008 


A 


AA 




homozygote 


ANKS1A 


IP2-42 




INDEL 


chr6:34902473 


G 


GT 




heterozygote 


CABIN 1 


171 




INDEL 


chr22:24455826 


G 


GAAAA 




homozygote 


CABIN 1 


83 




INDEL 


chr22:24448944 


T 


TT 




homozygote 


CANX 


IP2-69 




INDEL 


chr5 


179140762 


A 


AA 




homozygote 


CMYA5 


56 




INDEL 


chr5:78982956 


GCTT 


GCTTCTT 




homozygote 


EME1 


56 




INDEL 


chrl 7:48276005 


C 


CC 




homozygote 


LAMA3 


IP2-49 


1 


INDEL 


chrl 


3:21434967 


ATAAA 


A 




homozygote 


Mbo 1 A 


IDO A O 

lrz-4z 




Kinci 
INUtL 


chr4 


140619265 


T 
1 


TT 
1 1 




homozygote 


MRPS15 


56 




INDEL 


chrl 


36921785 


TA 


TGGAAAA 




homozygote 


SLC17A5 


IP2-78 




INDEL 


chr6:74351412 


AC 


ACACC 




homozygote 


SLC5A8 


IP2-66 




INDEL 


chrl 2: 101 550975 


CACA 


CACACA 




heterozygote 


SMARCA5 


171 




INDEL 


chr4 


144340520 


AAGAA 


AA 




heterozygote 


TRAPPC9 


IP2-83 




INDEL 


chr8 


141413543 


A 


AA 




homozygote 


ZNF100 


IP2-50 




INDEL 


chrl 9:21 908799 


CACA 


CA 




homozygote 


ZNFX1 


IP2-42 




INDEL 


chr20:47871283 


GACCCTTGGA 






homozygote 



variants in previously linked breast cancer genes, such as methyl- 
thioadenosine phosphorylase MTAP (p.K71R), and melanoma anti- 
gen MAGED1 (p.G87A) 50 . 

Studies revealing impaired interaction of splicing coordinator 
ESRP2 bearing a R353Q substitution. Among the novel and rare 
SNPs predicted to be protein-altering in our breast cancer samples, 
we selected to study the functional effect of the R353Q substitution 
in ESRP2, based on the established connection of ESRP2 to cancer 
through its role in epifhelial-to-mesenchymal transition (EMT) 51 ~ 53 . 



Arginine 353 is located in the second RNA recognition motif (RRM) 
domain of the ESRP2, which is known to interact with specific pre- 
mRNAs sequences. There are three RRM domains in ESRP2, and 
they are implicated in regulating the expression of specific splice 
variants of FGFR2, CTNND1 and ENAH that are involved in 
EMT. We applied site-directed mutagenesis to generate £SRP2 R353Q 
harboring expression vector and transfected MDA-MB-231 human 
breast cancer cells in parallel with expression constructs containing 
wild type ESPR2. After transfection, wild type and the mutant 
ESRP2 R353Q proteins were purified and compared for their ability 
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ESRP2 p.R353Q 
chrl6: 68265946 G>A 
Sample ID: IP2-42 Non-TNBC 





MAD2L1BP p.R227X 
chr6:43608127 OT 
Sample ID: IP2-48 NonTNBC 





B 



TPPl p.P320S 
chrll:6635782 OA 
Sample ID: 171 HER2 positive 





GLUD2 p.L314L 
chrX:120182480 A->G 
Sample ID: 83 HER-2 positive 




GBP1: p.R244H 
chrl:89523818 OT 
Sample ID: IP2-50 TNBC 




SLC30A8 p.Q28X 
chr8:118159203 C->T 
Sample ID: IP2-49 Non-TNBC 






Figure 3 | Sanger Sequencing validation of selected variants; IGV is also presented. (A) ESRP2, p.R353Q; (B) TPPl, p.P320S; (C) GBP1, p.R244H; (D) 
MAD2BP1, p.R227X. For TPPl, GBP1 and MAD both IGV and chromatogram show prevalence of the variant allele. 



to bind the FGFR2 pre-mRNA region through Electrophoretic 
Mobility Shift Assay (EMSA); the results are shown in Figure 5. 
We observed strong interactions between the wild type ESRP2 and 
the FGFR2 pre-mRNA as previously reported 52 . However, this 
interaction was significantly impaired for the mutant ESRP2 R353Q 
(compare lanes 6 and 7 with Lanes 2 and 3, Panels A-C), suggest- 
ing that the R353Q substitution compromises ESPR2 binding, and 
potentially, splice regulation of the FGFR2 pre-mRNA. This effect 
was observed in all three tested breast cancer cell lines: MCF-7 
(Figure 5A), MDA-MB-231 (Figure 5B), and BT-549 (Figure 5C). 
Further, in line with previous observations 52 , RT-PCR showed 
increased expression of the epithelial FGFR2 isoform Illb after 
transfection with wild type ESRP2 in the mesenchymal FGFR2 
Illc-expressing cell lines MDA-MB-231 and BT-549 (Figure 5D). 
This increase in FGFR2 Illb expression was lower (BT-549) to 
completely abolished (MDA-MB-231) after the transfection with 
the mutant ESRP2 R353Q (Figure 5D). 

Discussion 

Here we present the first mRNA sequencing based study that reports 
expressed variations in TNBC, Non-TNBC and HER2-positive 
breast cancer transcriptome. Several molecular mechanisms, such 



as RNA editing and allele preferential expression, could cause a 
discrepancy between the variations found at mRNA and DNA levels. 
Compared to exome and genome sequencing, RNA-seq provides 
essential insights into the functionality of the variants through 
estimation of the absolute and relative abundance of variant reads 
and the co-existence or mutual exclusion of variations, expression 
and splicing patterns. In addition to outlining the general landscape 
of the breast cancer variation transcriptome, our study reports novel 
variants in an allele-specific expression context, aligns our findings 
with the existing knowledge on breast cancer genetics, and exempli- 
fies efficient extraction of information from the transcriptome 
through extensive analyses. 

Mutations in previously associated breast cancer genes, BRCA1, 
BRCA2 and ATM, were called in 9/17 (53%) samples which is a 
higher than the previously reported mutation prevalence among 
breast cancer patients 5, 12 - 16 ' 18 - 20 . While only one patient was a carrier 
of known pathological variants in both BRCA1 and ATM, five other 
individuals carried missense substitutions in at least two different 
breast cancer associated genes (see Table 1). Whether the disease in 
these patients could be contributed to cumulative impaired function- 
ing of these genes is a subject of further investigation; nevertheless, 
the relatively frequent co-occurrence of protein altering variations in 
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Table 4 | Novel Exonic Variants that are seen in two and three out of the 1 7 breast Cancer Samples 

Gene Chromosomal Location cDNA Annotation Protein Annotation Function Samples Cancer subtype 

SLC30A8 chr8:l 18159203 C>T c.82 C > T p.Q28X stop-codon IP2-48 Non-TNBC 

IP2-49 Non-TNBC 

IP2-66 Non-TNBC 

ACL chrl:l 00387140 G>T c.4484 G > T p.C1495F missense IP2-49 Non-TNBC 

IP2-66 Non-TNBC 

GLUD2 chrX: 120 182480 A> G c.942 A > G p.L314L synonymous IP2-50 TNBC 

83 TNBC 

GPN1 chr2:27873001 A > G c.H01A>G p.E367E synonymous IP2-49 Non-TNBC 

IP2-66 Non-TNBC 



known breast cancer-associated genes in different cancers raises the 
necessity to examine larger series of patients and controls for com- 
binatorial genetic risk. An interesting observation is the high preval- 
ence of homozygote vs. heterozygote calls in BRCA1, BRCA2 and 
ATM for both breast cancer-associated genes, and those not known 
to be pathogenic variants, suggesting potential allelic loss in those 
genes. 

Among the essential findings of our study is a subset of novel SNPs 
and INDELs, some affecting genes previously implicated in breast 
cancer, in which, however, no predisposing or causative point var- 
iants have been reported so far. An example is p.K71R in MTAP, 
frequently seen co-deleted with the CDKN2A and CDKN2B tumor 
suppressor genes in a large cohort of 2000 breast tumors 50 . While 
the biological significance of p.K71R in MTAP and other novel 



variations in cancer-associated genes is currently unclear, overex- 
pression of novel variant over reference alleles points to a possible 
contribution to tumor initiation or progression. Since these variants 
have not been previously reported, they are not likely to be present in 
a homozygote state at the genomic level, and their allelic dominance 
may indicate expression or growth advantage, as well as potential loss 
of heterozygosity. Because such events may drive or contribute to 
cancer, a systematic investigation of allelic dominance of novel var- 
iants across larger expression sets is needed. 

In addition, we also identified a higher frequency (compared to 
non-breast cancer populations) of previously reported variants in 
many genes, including breast cancer-associated genes, such as 
PTPN12, PRCP, PLEC, DSG2 and ERBB2IP. Estimation of the pre- 
valence of such variants in larger breast cancer cohorts is needed as it 
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Figure 4 | Allele preferentially expressed novel missense variants through estimation of the ratio of reference and variant reads. The Variant-to wild 
type allele ration was estimated for all the novel SNPs called by 6 or more reads harboring the variant nucleotide. Fifty seven novel SNPs were called by 
variant reads only (i.e. no reference call was present at the corresponding position), and additional 53 showed higher than 5-fold number of variant 
calling reads over the wild type. Among the most preferentially expressed novel SNPs were novel missense variants in previously linked to breast cancer 
genes such as methyl-thioadenosine phosphorylase MTAP (p.K71R), and melanoma antigen MAGED1 (p.G87A). 
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Figure 5 | EMSA of ESRP2 interaction with FGFR2 ISE/ISS-3 cis-regulatory motif pre-RNA. The R353Q mutation in ESRP2 compromises FGFR2-IIIb 
expression. Vector (PIBX-CFF-B), ESRP2™' or ESRP2 R353Q were transiently transfected in breast cancer cell lines: MCF-7 (A), MDA-MB-23 1 (B) and BT- 
549 (C). RNA binding of ESRP2™ 1 or ESRP2 R353Q is shown. The incubated samples were resolved on 6% native-PAGE gel and detected by Phosphor 
imager. D) R353Q mutation in ESRP2 compromises FGFR2-IIIb expression. Vector (PIBX-CFF-B), ESRP2-Wt or ESRP2-Mut (R353Q) were transiently 
transfected in mesenchymal breast cancer cell lines, MDA-MB-23 1 and BT-549. FGFR-IIIb was detected by RT-PCR. The bands were quantified and 
normalized by the actin band intensities. 



may indicate contribution to genetic risk or co-existence with cau- 
sative mutations. Although this analysis holds promising potential to 
identify overrepresented alleles, it is important to take into account 
that transcriptome sequencing variant calls differs from the exome 
sequencing in allelic representation of homo- and heterozygote state 
(i.e. number of alleles). While homo vs. heterozygosity on transcrip- 
tome level provides an additional layer of information on the poten- 
tial functionality of these variants, the results should be used only 
after confirmation by independent studies. Nevertheless, statistical 
confidence may be increased for SNPs in which the difference is 
achieved through the analysis of high number of samples rather than 
homozygote appearance, such as PTPN12 and DSG2 (see Table 2). 
Such prevalent variants in genes implicated in breast cancer are 
worthy of investigation in independent breast cancer datasets. 

Similarly to the above discussed prevalence of mutant reads in 
breast cancer-associated variants, GWAS associated SNPs in our 
set also showed high prevalence of homozygous vs. heterozygous 
calls. This overall prevalence of variant over reference reads for vari- 
ant positions in cancer implicated genes, needs to be further inves- 
tigated as potential indicator of mechanistic implications, such as 
loss of heterozygosity or preferential allelic expression. As the 
information content of the transcriptome as a common denominator 
combining frequency and expression data is emerging, large scale 
studies are expected to enlighten the feasibility and the information 
value of these types of analyses 32 . 

Finally, we selected a rare, predicted to be protein damaging mis- 
sense substitution from our dataset - p.R353Q in the splicing coor- 
dinator ESRP2 - to demonstrate in vitro the effect of the p.R353Q 
substitution on the ESRP2 protein function. We were able to show 
that the replacement of the arginine 353 with the variant glutamine 
leads to a significant reduction of the binding ability of ESRP2 to 



FGFR2 pre-mRNA. Thus, this could potentially affect epithelial-to- 
mesenchymal transition programs. 

Overall, our analysis identified enrichment of variants known to 
be implicated in breast cancer as well as novel and rare variants in 
genes associated with breast cancer in our set of 17 breast cancer 
samples. Further, the within-individual exploration of the variance 
showed multiple disease associating variants in most of the indivi- 
duals, and points to the need for estimation of cumulative action of 
genetic alterations. This study reports an initial collection of variants 
that are expressed across the breast cancer transcriptome, including 
novel and reported mutations in their allelic abundance and co- 
presence with other variants. In addition to providing an overall 
variation landscape of the breast cancer transcriptome, such as 
expressed SNP density and deleterious variants scaffold, we exem- 
plify different analytical strategies to search for molecular interac- 
tions and regulatory networks potentially implicated in breast 
tumorigenesis. Compared to exome and genome studies, transcrip- 
tome exploration provides higher information content through the 
estimation of the expression abundance, in the immediate context of 
allelic prevalence and co-existence with expression and splicing 
features 54,55 . It is essential to keep in mind however that the tran- 
scriptome only captures a snap shot and further functional char- 
acterization of the observed molecular features is needed to prove 
disease-causative relationships. Nevertheless, our study provides an 
important breast cancer transcriptome dataset for further explora- 
tions on either high-throughput or individual gene/protein scale. 

Methods 

Human patient samples. The human breast cancer tissue RNA samples were 
provided by Dr. Suzanne Fuqua (Baylor College of Medicine). All of the human 
samples were used in accordance with the IRB procedures of Baylor College of 
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Medicine and Dana-Farber Cancer Institute and Harvard Medical School, 
respectively. The breast tumor types, TNBC, Non-TNBC and HER2-positive, were 
classified on the basis of RNA sequencing FPKM abundance and 
immunohistochemical and qRT-PCR classification (data not shown) as previously 
described 5455 . All breast cancer patients were from European descent. 

Illumina genome sequencing RNA sequencing library preparation. Large and 
small ribosomal RNA (rRNA) was removed from total RNA using RiboMinus 
Eukaryote Kit {Invitrogen, Carlsbad, CA). Five micrograms of total RNA were 
hybridized to rRNA-specific biotin labeled probes at 70 C for 5 minutes. The rRNA- 
probe complexes were then removed by streptavidin- coated magnetic beads. The 
rRNA-free transcriptome RNA was concentrated by ethanol precipitation. The 
cDNA synthesis and DNA library construction for all the seventeen samples were 
performed as described 54 ' 55 . 

Read alignment and transcript assembly. The paired end raw reads were aligned 
using the TopHat version 1.2.0 that allows two mismatches in the alignment. The 
aligned reads were assembled into transcripts using cufflinks version 2.0.0. The 
alignment quality and distribution of the reads were estimated using SAM tools. From 
the aligned reads, de novo transcript assembly was performed to capture the major 
splice rearrangements and novel variations that occur in the transcriptomes of TNBC, 
Non-TNBC and HER2- positive breast cancers in comparison to NBT using cufflinks 
version 1.3.036. The cuffcompare program was used to identify transcripts that are 
identical to the reference human genome (the Ensembl GRCh37.62 B (hgl9) 
reference genome). Further analysis and novel isoform call was performed through 
the reconstructed transfrags that comprise novel splice junctions and share at least 
one splice junction with a reference transcript. The very low abundant transcripts 
were identified by binning the transcripts according to their FPKM and the 
transcripts with FPKM below 0.3 were eliminated from further analysis. All the 
analyses presented in this manuscript are performed using two categories of 
transcripts: transcripts that are identical to reference and transcripts that comprise 
novel junctions. The global statistics, which includes the distributions of FPKM scores 
across samples and the dendogram that shows the relationship between the samples 
based on the reconstructed transcripts, were analyzed using cummeRbund package of 
cufflinks suite of programs. The average exon number was in the reassembled 
transcripts is comparable to the human genome reference average. To annotate novel 
splice events, we used Multivariate Analysis of Transcript Splicing (MATS). 
Additionally, for consistency checking and independent validation we used an in- 
house built program (http://ccb.jhu.edu/software/ASprofile/) to compare the exon 
models between isoforms assembled with the program cufflinks for the normal and 
cancer samples. As mentioned earlier, only the isoforms that are similar to reference 
and isoforms that comprise novel splice junctions were considered. We determined 
the splicing differences indicative of exon inclusion, exclusion, alternative 5 ' , 3 ' , and 
intron retention events. 

Variants call and annotation. Variants calls were obtained using Mpileup utility of 
SAMTools (http://samtools.sourceforge.net/mpileup.shtml). Base Alignment 
Quality was used to score the variant call. Consensus calling is done using bcftools. 
Maximum depth call was set at 100000. The variants were annotated using SeattleSeq 
Annotation Tools version 8.01, dbSNP build 137 (http://snp.gs.washington.edu/ 
SeattleSeqAnnotation 137/). 

Sanger sequencing. First- strand cDNA was synthesized with Superscript III reverse 
transcriptase (Invitrogen, Inc) using 1 //g of total RNA and mixture of oligo dT 
primer and random hexamers. For selected variants, cDNA primers flanking the 
variant position were designed using Primer3 56 and in RT-PCR to amplify the region 
of interest. The products were separated on 1% agarose gel, excited and purified using 
QIAquick Gel Extraction Kit (Qiagen, Inc.) according to the manufacturer 
instructions. The purified fragments were subjected to bi-directional Sanger 
sequencing with the forward and the reverse primer used for the amplification. 

Statistics. To test if the distribution of variant alleles differed between our group and 
non-breast cancer populations, we applied chi-square test (2X2 tables). All the 
values were subjected to Yates correction for contingency to prevent overestimation 
of significance; p values below 0.05 were considered significant. 

EMSA. To determine if the R353Q substitution affects the ability of ESRP2 to bind its 
substrate we used wild type and mutant FLAG-tagged ESRP2 ORE introduced in 
PBIX as previously described 52 . Three cell lines MCF-7, MDA-231 and BT-549 were 
transfected cell lines using FuGENE® Transfection Reagents (Promega, Inc.) 
according to the manufacturer recommendations. Nuclear extracts were prepared 
using a Nonidet P-40 lysis method. RNA oligos of ISE/ISS-3 were end labeled with 
using the annealed [y- 32 P] ATP in a 20 [A reaction mixture for 15 min at room 
temperature. RNA probes were incubated with respective nuclear extracts. Samples 
were run on a non- denaturing 6% polyacrylamide gel and imaged by 
autoradiography. Specific competitions were performed by adding a 100-molar 
excess of unlabeled probe to the incubation mixture and supershift Electrophoretic 
mobility shift assay (EMSA) were performed using FLAG antibody (Sigma- Aldrich). 
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